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ABSTRACT 



A method and apparatus for fuding the best or near best 
binary classification of a set of observed events, accord- 
ing to a predictor feature X so as to minimize the uncer- 
tainty in the value of a category feature Y. Each feature 
has three or more possible values. First, the predictor 
feature value and the category feature value of each 
event is measured. The events are then split, arbitrarily, 
into two sets of predictor feature values. From the two 
sets of predictor feature values, an optimum pair of sets 
of category feature values is found having the lowest 
uncertainty in the value of the predictor feature. From 
the two optimum sets of category feature values, an 
optimum pair of sets is found having the lowest. uncer- 
tainty in the value of the category feature. An event is 
then classified according to whether its predictor fea- 
ture value is a member of a set of optimal predictor 
feature values. 

2 Claims, 4 Drawing Sheets 
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category feature Y for each value Xm of the predictor 
METHOD AND APPARATUS FOR FINDING THE feature X. 

BEST SPLITS IN A DECISION TREE FOR A That is, the predictor feature values Xm are first or- 

LANGUAGE MODEL FOR A SPEECH dered, in the Breinaan ct al method, according to the 

RECOGNIZER 5 values of P(Y| | Xm). Next, a number of subsets SX/ of 

set SX are defined, where each subset SX/contaim only 
BACKGROUND OF THE INVENTION those values Xm having the i lowest values of the condi* 

— . , ^ ^. , ./5 ^ r tionalprobabilityPfYilXm). When there are Mdiffer- 

TTie invention rentes to the binary cla^tton of ^ ^ there wiU be (M-l) subsets SX,. 

observed events for constnictmg decision tree for use ^ described by Breiman ct al. one of the subsets SX/ 
m pattern recogmtion systems. The observed events ^^^^^^^^ ^ information, and therefore minimizes the 
may be^ for example^ spoken words or written charac- ^^^^^ ^^^^ of the class Y. 
ters. Other olweivcd events which may be classified ^^^^ ^ ^ provided a shortened method for 

according to the mvcntoon include, but are not Imuted ^ ^ cUssification of events according 

to medical symptoms, and radar patterns of objects. jj ^ ^ „i*easurement or predictor variable X for the simple 

More specifically, the mvcntion relates to findmg an ^ ^^^^ 
optimal or near optimal binary classification of a set of y ^ Breiman ct al do not describe how to 

observed events fi.e. a training set of observed events), ^ ^^^^ predictor variable X when the 

for the creation of a binary decision tree. In a binary ^ ^^^^ or categories, 

decision tree, each node of the tree has one input path 20 

and two output paths (e.g. Output Path A and Output SUMMARY OF THE INVENTION 

Path B). At each node of the tree, a question is asked of ^^^^^ invention to provide a method 

the form "Is X an element of the set SX?" If the answer ^ ^ apparatus for finding the optimum or near optimum 
is "Yes", then Output Path A is followed from the node. classification of a set of observed events accord- 

If the answer is "No", then Output Path B is followed 25 a predictor feature X having three or more differ- 
from the node. possible values Km, so as to minimize the unccr- 

In general; each observed event to be classified has a ^^^^^y ^alue of the category feature Y of the 

predictor feature X and a category feature Y. The pre- observed events, where the category feature Y has 
dictor feature has one of M different possible values Xm. three or more possible values Y„. 
and the category feature has one of N possible values 30 ^ another object of the invention to provide such a 
Yfl, where m and n are positive integers less than or classification method in which the optimum or near 
equal to M and N; respectively. optimum classification of the predictor feature X can be 

In constructing binary decision trees, it is advanta- f^^^ without enumerating all of the possible different 
geous to find the subset SX<pf of SX for which the infor- subsets of the values of the predictor feature X, and 
mation regarding the category feature Y is mawmiied, without calculating the uncertainty in the value of the 
and the uncertainty in the . category feature Y is mini- category feature Y for each of the subsets, 
mized. The answer to the question "For an observed j^g invention is a method and apparatus for classify- 
event to be classified, is the value of the predictor fea- j^g a set of observed events; Each event has a predictor 
ture X an clement of the subset SXop/7" will then give, ^ feature X and a category feature Y. The predictor fea- 
on average over a plurality of observed events, the ture has one of M different possible values Xm, and the 
maximum reduction in the uncertainty about the value category feature has one of N possible values Yn- In the 
of the category feature Y. method. M and N are integers greater than or equal to 

One known method of finding the best subset SXo^^; 3, m is an integer from 1 to M. and n is an integer from 
for minimizing the uncertainty in the category feature Y ^ 1 to N. 

is by enumerating all subsets of SX, and by calculating According to the invention, the predictor feature 
the information or the uncertainty in the value of the value Xm and the category feature value Ym of each 
category feature Y for each subset. For a set SX having event in the set of observed events are measured. From 
M elements, there are li^-i)- \ different possible sub- the measured predictor feature values and the measured 
sets. Therefore, 2^M-^)-\ information or uncertainty ^ category feature values, the probability PpCm» Y,0 of 
computations would be required to fmd the best subset occurrence of an event having a category feature value 
SX. This method, therefore, is not practically possible Yb and a predictor feature value Xm are estimated for 
for large values of M, such as would be encountered in each Yn and each Xm- 

automatic speech recognition. Next, a starting set SX<yi/(t) of predictor feature val* 

Leo Breiman et al (Class^ication And Regression 55 ues Xm is selected. The variable t has any initial value. 
7>tes, Wadsworth Inc., Monterey, Calif., 1984, pages From the estimated probabilities, the conditional 
101-102) describe a method of finding the best subset probability P(SX4v>i(t) I Yr) that the predictor feature has 
SX<^ for the special case where the category feature Y a value in the set SX^KO when the category feature has 
has only two possible different values Yt and Y2 (that is. a value Yn is calculated for each Yj,. 
for the special ..case where Nb2).. In this method, the 60 From the conditional probabilities P(SX<yM(t)|Yn) a 
complete enumeration requiring 2<*'- >)- 1 information number of pairs of sets SY/t) and 5Y/t) of category 
computations can be replaced by information computa- feature values Yn are defmed, where J is an integer from 
tions for only (M - 1) subsets. 1 to (N- 1). Each set SY/t) contains only those cate- 

The Breiman et al method is based upon the fact that gory feature values Yn having the j lowest values of the 
the best subset SXopr of the set SX is among the increas- 63 conditional probability P(SX<yrKOI Yti)- Bach set S7^t) 
ing sequence of subsets defined by ordering the predic- contains only those category feature values Y« haying 
tor feature values Xm according to the increasmg value the (N - j) highest values of the conditional probability, 
of the conditional probability of one value Yi of the Thus, 5Y/t) is the complement of SY/t). 
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From the (N - 1) pairs of sets defuied above, a single The method and apparatus according to the invention 

pair of sets SYopfi) and W^op^t) is found having the are advantageous because they can identify, for a set of 

lowest uncertainty in the value of the predictor feature. observed events, the subset of predictor feature values 

Now, from the estimated event probabilities PpCm> of the events which minimi?^ or near minimirr the 

Yrt), the conditional probability P(SY<^t) | Xm) that the 5 uncertainty in the value of the category feature of the 

category feature has a value in the set S Yopi(t) when the events, where the category feature has three or more 

predictor feature has a value Xm is calculated for each possible values. The invention fmds this subset of pre- 

value Xm. A number of pairs of sets SX<t+l) and dictor feature values in an efficient nianner without 

SXXt+ 1) of predictor feature values Xm are defined requiring a complete enumeration of all possible subsets 

where i is an integer from 1 to M. Each set SX/(t+l) 10 of predictor feature values, 

contains only those predictor feature values Xm having BRIEF DESCRIPTION OF THE DRAWING 
the i lowest values of the conditional probability 

P(SYon,(t)|X„). Each set SXKt+ 1) contains only those HG. 1 is a flow chart of the method of classifying a 

predictor feature values X^ having the (M-i) highest «^ ©f observed events according to the present mven- 

valuesof the conditional probability. There are (M-1) ^ ^.^ .^^ 

pairs of such sets. ^ * showmg how the sets with the 

From the (M-1) pairs of sets defined above, a single uncertunty are found, 

pair of sets SX,^Kt + D and SX.pKt+ 0 is found having . ^1^. 3 is a block diagram of an apparatus for classify- 

the lowest uncertainty in the value of the category accordmg to the present 

feature. invention. 

TTiereafter. an event is classified in a first class if the , "^T^-r T "^^^ ""'J ^ "^u 

predictor feature value of the event is a member of the the classifications produced by the 

sa SXeMt+ 1). An event is classified in a second class if ^^^^ *PI^^^ to the present mven- 

Se L'S'^^t? ir ^ * « ' ^ ^'^^ °f ^ recognition 

Acprt. J I' * • 1 r J ■ system containing a word context match according to 

In one aspect of the invention, an event is classified m oresent invention 

the fim class or the second class by producing a chissif.- piG. 6 is a block diagram of a portion of the word 

aition signal identifying the event as a member of the ^^j, 5 »~ 
first class or the second class. 

In another aspect of the invention, the method is DESCRIPTION OF THE PREFERRED 

iteratively repeated until the set SXo/,i(t+ 1) is equal or EMBODIMENTS 

substantially equal to the pvtvioMS set SX^^^Kt). HG. 1 is a flow chart of the method of classifying a 

Preferably, the pair of sets having the low<^t uncer- observed events according to the present inven- 

tainty is found by calculatmg the uncertainty for every 35 ^on. Each event in the set of observed events has a 

pair of sets. Alternatively, the pair of sets havmg the predictor feature X and a category feature Y. The pre- 

lowest uncertainty may be found by calculating the dictor feature has one of M different possible values X„. 

uncertainty of each pair sets in the order of increasing category feature has one of N possible values Y„. M 

conditional probabUily. The pair of sets with the lowest and N are positive integers greater than or equal to 3, 

uncertainty is found when the calculated uncertainty 40 and need not be equal to each other. The variable m is 

stops decreasmg. a positive integer less than or equal to M. The variable 

In a fiirther aspect of the invention, the predictor „ is a positive integer less than or equal to N. 

feature value of an event not in the set of events is mea- Each event in the set of observed events may be, for 

sured. The event is classified in the first class if the example, a sequence of uttered words. For example, 

predictor feature value is a member of the set 45 each event may comprise a prior word, and a recogni- 

SXopKt+l). and the event is classified in the second tion word following the prior word. If the predictor 

dass if the predictor feature value is a member of the set feature is the prior word, and the category feature is the 

SXopi(t+ 1). recognition word, and if it is desired to classify the 

In the method and apparatus according to the present events by finding the best subset of prior words which 

invention, each event may be. for example, a spoken 50 minimirgs the uncertainty in the value of the recogni- 

utterance in a series of spoken utterances. The utter- tion word, then the classification method according to 

ances may be classified according to the present inven- the present invention can be used to identify one or 

tion, for example, for the purpose of producing models more candidate values of the recognition word when 

of similar utterances. Alternatively, for example, the the prior word is known. 

utterances may be classified according to the present 33 Other types of observed events having predictor 

invention for the purpose of recognizing a spoken utter- feature/category features which can be classified by the 

vice. method accortling to the present invention include, for 

In another example, each event is a spoken word in a example, medical symptoms/illnesses, radar pattern- 
series of spoken words. The predictor feature value of a s/objects, and visual patterns/characters. Other ob- 
word comprises an identification of the immediately 60 served events can also be classified by the method of the 
preceding word in the series of spoken words. present invention. 

The invention also relates to a method and apparatus For the purpose of explaining the operation of the 

for automatically recognizing a spoken utterance. A invention, the observed events will be described as prior 
predictor word signal representing an uttered predictor word/recognition word events. In practice, there may 

word is compared with predictor feature signals in a 65 be thousands of different words in a language model, 
decision set. If the predictor word signal is a member of and hence the predictor feature of an event will have 
the decision set. a first predicted word is output, other- one of thousands of different available predictor feature 

wise a second predicted word is output. values. Similarly, the category feature of an event will 
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have one of thousands of different available category 
feature values. In such a case, to find the best subset 
SXcpt of the predictor feature values which minimizes 
the uncertainty in the value of the category feature, 
would require the enumeration of 2<*'-*)— 1 subsets. 
For values of M in the thousands, this is not practical. 

According to the present invention, it is only neces- 
sary to enumerate at most approximately K(M+N) 
suteets to And the best subset S^;, as described below. 
K is a constant, for example 6. 

Whfle the predictor feature and the category feature 
may have thousands of different possible values, for the 
purpose of explaining the invention an example will be 
described in which M = 5 and Ns= S. In gener^, M need 
not equal N. 

Table 1 shows an example of a predictor feature hav- 
ing five different possible values Xi to X5. The predic- 
tor feature (prior word) values are "travel", "credit", 
"special", "trade", and "business", respectively. 

TABLE 1 



occurrence of an event having a category feature value 
Yb and a predictor feature value Xo can be estimated as 
the total number of events in the set of observed events 
having feature values (Xe, Y*) divided by the total num- 
ber of events in the set of observed events. 

Table 4 is an example of hypothetical estimates of the 
probabilities P(Xmi Y/,). 

TABLE 4 



10 



15 





Ys 


a030616 


a024998 


0JO8216S 


0.014540 


a008761 


Y4 


O.OU505 


a013090 


0.O681M 


0.054667 


a024698 


Yj 


0.074583 


OXHM37 


0.033092 


0076762 


a000654 


Y2 


a042840 


a08l654 


a011594 


Oj055642 


a050448 


Y| 


aQ22fiB2 


aOSt336 


0012359 


aOOS319 


a0740I2 




Xi 


Xz 


X3 


X4 


X5 



20 



After the probabilities are estimated, a starting set 
SXc^iKO of predictor feature values Xm is selected The 
variable t has any initial value. In our example, we will 
arbitrarily select the starting set SXcpfi) equal to the 
predictor feature values X| and X2 (that is, the prior 
words "travel" and "credit"). 
Continuing through the flow chart of FIG. 1, from 
23 the estimated probabilities P(Xm, Yn), the conditional 
probability P(SXopi(t) | Y«) that the predictor feature has 
a value in the set SXcrpr(t) when the category feature has 
a value Yb is calculated for each Y^. Table 5 shows the 
joint probabilities of SXopKO and Yn, the joint probabili- 
Table 2 shows a category feature having five differ- 30 ties of SXopKO and Y„, and the conditional probability 



PREDICTOR 

FEATURE 

VALUE 


PRIOR 
WORD 

(PREDICTOR WORD) 


Xi 


travel 


X2 


crrdit 


X3 


cpecia] 


X4 


trade 


X5 


business 



ent possible values Y| to Y5. The category feature (rec- 
ognition word) values are "agent", "bulletin", "man- 
agement", "consultant", and "card", respectively. 

TABLE 2 



CATEGORY 

FEATURE 

VALUE 


RECOGNmON 
WORD 

(PREDICTED WORD) 


Yi 


•gent 


Y2 


bulletin 


Y3 


management 


Y* 


consultant 


Y5 


card 



of SXopKO given Yn for each value of the category 
feature Y, and for t equal 1. These numbers are based on 
the probabilities shown in Table 4. where 

Hsxt^^t), y«)»wj. y«)+/tu n)+W5. ^n). 

and 



40 



Returning to FIG. 1, according to the invention the 
predictor feature value Xm and the category feature Yn 45 
of each event in the set of events are measured. Table 3 
shows an example of possible measurements for ten 
hypothetical events. 



TABLE 5 



P(SXcj«(t), Y„) P(5%y/t). YiO P(SXqpKt)lYn) 



TABLE 3 




PREDICTOR 


CATEGORY 




FEATURE 


FEATURE 


EVENT 


VALUE 


VALUE 


1 


Xs 


Y4 


2 


Xj 


Yj 


3 


Xj 


Y4 


4 


Xj 


Y| 


5 


X4 


Y4 


6 


X2 


Y5 


7 


X3 


Yj 


8 


Xs 


Y5 


9 


Xj 


Y4 


10 


X4 


Yj 



SO 



Y5 
Y4 
Y3 
Yj 
Yi 



0.053614 
0.027596 
0.123021 
0.124494 
0.074418 



0.105466 
0.147501 
0.112509 
0.117686 
0.111691 



0.345258 
0.157604 
0.322314 
a514056 
0.399862 



• {X1.X2) 



55 



60 



From the measured predictor feature values and the 
measured category feature values in Table 3, the proba- 
bility P(Xmt Yii) of occurrence of an event having a 63 
category feature value Yn and a predictor feature value 
X^ is estimated for each Yn and each Xm- If the set of 
events is sufficiently large, the probability P(Xii, Y^) of 



After the conditional probat»ilities are calculated, a 
number of ordered pairs of sets SY/t) and SY/t) of 
category feature values Yn are defmed. The number j of 
sets is a positive integer less than or equal to (N— 1). 
Each set SY/t) contains only those category feature 
values Yft having the j lowest values of the conditional 
piobabUity P{SX<^t)lYfl). Each set SY/t) contains 
only those category feature values Yn having the (N — j) 
highest values of the conditional probability. 

From the conditional probabilities of Table 5, Table 6 
shows the ordered pairs of sets SYi(t) and STi(t) to 
SYs(t) and 5?5(t). Table 6 also shows the category 
feature values Ya in each set. 
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SET CATEGORY 
SY/i) FEATURE 
SV/t) VALUES 



UNCERTAINTY 
(From 
TABLE 4) 



UNCERTAlhTTY 
<From 
TABLE 5) 



SYiO) Y4 2.258366 0.930693 
Yj.Y,.Y2.Yj 

SY2(t) Y4, Yj 2^743 a934793 

S72(t) Yj,YiY3 

SYjO) Y4,YsYi 2J07630 0.938692 

SY3(t) Y2,Yj 

SY4(t) Y4.Y5.YKY2 2.214523 0.961389 

SY4(i) Y3 



From the estimated probabilities P(Xm, Yn)» new 
conditional probabilities P(SYopKt)|Xm) that the cate- 
gory feature has a value in the set SYc^t) when the 
predictor feature has a value Xm are calculated for each 
5 Xm. Table 7 shows these conditional probabilities, based 
on the joint probabilities of Table 4, where 

wr^O. x^)=F{Y^ x^)-\-P{ru Xm) 

10 asFcpKO. Xm)'-i\rh Jrm)+nn. x«)+^cy> x^ 



15 



SYq^t) - SYjd) - {Y*.Y,) 
ST^O - 572(0 - <Y,.Y3.Y,> 

t a 1 



20 



For each of the (N- 1) ordered pairs of sets SYyand 
5Y> the uncertainly in the value of the predictor feature 
Xmmay be calculated in two different ways. Preferably, 
the uncertainty is calculated from the probabilities of 
Table 4. Alternatively, uncertainty may also be calcu- 
lated from the probabilities of Table 5. In general, an 
uncertainty calculation based upon the individual prob- 
abilities of Table 4 will be more accurate than an uncer- 
tainty calculation based upon the sets SX^Xt) and 
S5Jo;,/(t) probabilities of Table 5. However, calculating 
the uncertainty in the reduced Table S is faster, and may 
lead to the same result. 

In Table 6. the uncertainty H(SpIit SY/t)) in the 
value of the predictor feature X for the sets SY/t) and 
SY/t) was calculated from the Table 4 probabilities 
according to the formula 



l\SycpAi)\Xm) = ' 




H^YtpAO, x^ + fiSYcp^O, jr«) 




TABLE 7 




p(SY(5rt(a x,^ 




P(SYc^t)|X„) 


Xj 0.033459 
X4 0.069207 
X) 0.1S03W 
X: a038088 
Xi 0.045122 


0.125115 
0.137724 
0.079046 
0.181628 
ai4030S 


0.211002 
0.334445 
0.655340 
0.173353 
0.243340 


SY^t) = {Y4.Y5> 
SYi^O - {Y,.Yj.Y)> 



H(Spiit 5r/i)) = asr/O) mx^\SY/^t)) + 



Next, ordered pain of sets SX/(t 4- 1) and S5CXt -f 1) of 
predictor features Xm are defined. The variable i is a 
30 positive integer less than or equal to (M — 1). Each set 
SXXt+ 1) contains only those predictor feature values 
Xm having the i lowest values of the conditional proba- 
bility P(SYfljpXt)|Xm). Each set ^Ut+ 1) contains only 
those predictor feature values Xm having the (M— i) 
35 highest values of the conditional probabilities. 

Table 8 shows the ordered pairs of sets SXi(t-|- 1) and 
SXi(t-f 1) to SX4(t+ 1) and 8X4(1 -I- 1) of predictor fea- 
ture values based upon the conditional probabilities of 
Table 7. Table 8 also shows the predictor feature values 
40 associated with each set 

TABLE 8 



where 



HiX„\SYp)) 1 



m«l 



SET 

SXXt + I) 
45 §5Jxt + 1) 



PREDICTOR 

FEATURE 

VALUES 



UNCERTAINTY 
(From 
TABLE 4) 



UNCER. 
TAINTY 
(From 
TABLE 7) 



nSY/t))^ 1 RX^SYp)), 



tnd 



SXi(t + 1) 
SX,(t + 1) 
SX2(i + 1) 
^2(1 + 1) 
vv SXj(t + 1) 
^ SX)(I + 1) 
SX4(I + 1) 
SX»(t + I) 



X2 

Xs, Xi, X4. Xi 
Xi, Xj 
Xi. X4, X3 
Xj, X5. Xi 
X4,X3 

X2. X5» X|, X4 

Xj 



2.264B94 



2.181004 



2.201608 
2.189947 



0.894833 
0.876430 
0.850963 
0.827339 



/vrm.5y//)) - ^J^^^^I\X„, Yn)- 



SX^t + I) - SX:<l + I) - {Xj,Y5> 

Sx^i + 1) - STjd + I) - (x,^xj 
55 1.1 



In Table 6, the base 2 logarithm was used. (See, for 
example, Encyclopedia of Statistical Sciences, Volume 2, 
John WUey & Sons, 1982, pages 312-516.) 

Thus, the uncertainty in the value of a feature for a 60 
pair of sets was calculated as the sum of the products of 
the probability of occurrence of an event in the set, for 
example P(SY/(t)), times the uncertainty, for example 
H(Xm|SY/t)), for that set. 

As shown in Table 6, the pair of sets SY(]^t) and 65 
WfopM having the lowest uncertainty in the value of 
the predictor feature X is the pair of sets SY2(t) and 
S72(t), as calculated from the probabilities of Table 4. 



In a similar manner as described above with respect 
to Table 6, the uncertainties in the values of the cate- 
gory feature were calculated for each pair of sets from 
the probabilities in Tables 4 and 7, respectively. As 
shown in Table 8, the pair of sets SXt^^Kt+l) and 
S%ospi(t+l) having the lowest uncertainty in the value 
of the category feature were. 8X2(1+ 1) and 5X2(t + 1), 
where t= 1, as calculated from the probabilities of Table 
4. 

Finally, having found the pair of sets SXc{pf(t+ 1) and 
SJEtirpXt+l), an event is clawed in a first class if the 
predictor feature value of the event is a member of the 
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set SXcpXt+1). If the predictor feature value of an SY/t) and SY/t) are defined and stored in storage de- 
event is a member of the set S^Kt+ 1)* then the event vice 18. 

is classified in a second class. Still under the control of the program* the processor 

WhOe it is possible, according to the present inven- 10 calculates the uncertainties of the (N— 1) ordered 
tion; to classify the set of observed events according to 5 pairs of sets, and stores the sets SY<^t) and SYcpXt) 
the first values of SX<^t+ 1) and S^i(t+ 1) obtamed, having the lowest uncertainty in the value of the predic- 
better subsets can be obtained by repeating the process. tor feature in the storage device 18. 
That is, after finding a pair of sets SXopKt+l) and In a similar mannerjprocessor 10 finds and stores the 
SCepKtH- l)t but prior to classifying an event, it is pre- sets SXcp^t-^ 1) and Sx^^tH- 1) having the lowest un- 
ferred to increment t by 1, and then repeat the steps 10 certainty in the value of the category feature, 
from calculating the conditional probabilities ?{SXcp^i' FIG. 4 shows an example of a decision tree which can 
)|Yn) through finding SXo;rXt+l) and S%<^t+1). be generated by the method and apparatus according to 
These steps may be repeated, for example, until the set the present invention. As shown in FIG. 4, each node of 
SX<yj(t-h 1) is equal or substantially equal to the previ- the decision tree is associated with a set SZo through 
ous set SX<9^t). Alternatively, these steps may be re- 15 SZeof events. The set SZo at the top of the tree includes 
peated a selected number of times. all values of all predictor features. The events in set 

The present invention was used to classify sets of SZo are classified, in the second level of the tree, ac- 
events based on randomly generated probability distri- cording to a first predictor feature X', by using the 
btttions by repeating the steps described above, and by method and apparatus of the present invention, 
stopping the repetition when SXc{pKt+l)=SXci;^t). The 20 For example, if the first predictor feature X' is the 
classification was repeated 100 times each for values of word which immediately precedes the word to be rec- 
M from 2 to 12 and values of N from 2 to 16. The classi* ognized Y, then the set of events SZo is split into a set of 
fication was completed in all cases after not more than events SZi, in which the prior word is a member of set 
5 iterations, so that fewer than S (M-f N) subsets were SX'c^r and a set SZ2 in which the prior word is a mem- 
examined. The information (uncertainty) in the classifi- 25 ber of the set ^'opt- 

cations obtained by the invention were close to the At the third level of the tree, the sets of events SZ| 
optimum. and SZ2 are further split, for example, by a second pre- 

FIG. 2 is a flow chart showing one manner of finding dictor value X" (the word next preceding the word to 
the sets SYo^^t) and S?<^f(t) having the lowest uncer- be recognized). Each node of the decision tree has asso- 
tainty in the value of the predictor feature. After the 30 ciated with it a probability distribution for the value of 
ordered pairs of sets SYy and SY/ arc defined, the van- the word to be recognized (the value of the category 
able j is set equal to 1. Next, the uncertainty H(Split feature). As one proceeds through the decision tree, the 
SY/t)) is calculated for sets SY/t) and S7/t), and the uncertainty in the value of the word to be recognized Y 
uncertainty H(Split SY(/+i)(t)) is calculated for the sets is successively reduced. 

SY(/4. 1) and §7(;4 i/t). llie uncertainty H(Split SY/t)) 35 FIG. 5 is a block diagram of an automatic speech 
is then compared to the uncertainty H(Split SY(/+ i)(t)). recognition system which utilizes the classification 
If H(Split SY/t)) is less than H(Split SY(;>i)(t)) then method and apparatus according to the present inven- 
SYopXt) is set equal to SY/t), and SYopKO is set equal to tion. A similar system- is described in, for example, U.S. 
S?/t). If H(Split SY/t)) is not less than H(Split Pat. No. 4,759,068. The system shown in FIG. 5 in- 
SY(/+i)(t)), then j is incremented by 1, and the uncer- 40 dudes a microphone 20 for converting an utterance into 
tainties are calculated for the new value of j. an electrical signal. The signal from the microphone is 

The sets SXo^i(t+ 1) and S%opXt+ 1) having the low- processed by an acoustic processor and label match 22 
est uncertainty in the value of the category feature can which finds the best-matched acoustic label prototype 
be found in the same manner. from the acoustic label prototype store 24. A fast acous- 

FIG. 3 schematically shows an apparatus for classify- 45 tic word match processor 26 matches the label string 
tng a set of observed events. The apparatus may com- from acoustic processor 22 against abridged acoustic 
prise, for example, an appropriately programmed com- word models in store 28 to produce an utterance signal, 
puter system. In this example, the apparatus comprises a The utterance signal output by the fast acoustic word 
general purpose digital processor 10 having a data entry match processor comprises at least one predictor word 
keyboard 12, a display 14, a random access memory 16, 50 signal representing a predictor word of the utterance. In 
and a storage device 18. Under the control of a program general, however, the fast acoustic match processor 
stored in the random access memory 16, the processor will output a number of candidate predictor words. 
10 retrieves the predictor feature values Xm and the Each predictor word signal produced by the fast 
category feature values Y„ from the training text in acoustic word match processor 26 is input into a word 
storage device 18. From the feature values of the train- 55 context match 30 which compares the word context to 
ing text, processor 10 calculates estimated probabilities language models in store 32 and outputs at least one 
P(Xmi Yff) and stores the estimated probabiUties in stor* category feature signal representing a candidate pre- 
age device 18. dieted word. From the recognition candidates pro- 

Next, processor 10 selects a starting set SX^pfii) of duced by the fast acoustic match and the language 
predictor feature values; For example, the set SXopi(0> ^ model, the detailed acoustic match 34 matches the label 
where t has an initial value, may include the predictor string from acoustic processor 22 against detailed 
feature values Xi through Xm/2 when (M is an even acoustic word modeb in store 36 and outputs a word 
integer), or Xi through Xm-^\/2 (when M is an odd string correspondmg to the utterance, 
integer). Any other selection is acceptable. FIG. 6 is a more detaUed block diagram of a portion 

From the probabilities P(Xm. Yr) in storage device 65 of the word context match 30 and language model 32. 
18, processor 10 calculates the conditional probabilities The word context match 30 and language model 32 
P(SXcpf(0|Yfl) and stores them in storage device 18. include predictor feature signal storage 38 for storing all 
From the conditional probabilities, ordered pairs of sets the predictor words. A decision set generator 40 (such 



10/31/2003, EAST version: 1.4.1 



11 



5,263,117 



12 



as the apparatus described above with respect to FIG. 
3) generates a subset of the predictor words for the 
decision set. 

A controller 42 directs the storage of predictor word 
signals from the fast acoustic word match processor 26 5 
in predictor word signal storage 44. 

A predictor word signal addressed by controller 42 is 
compared with a predictor feature signal addressed by 
controller 42 in comparator 46. After the predictor 
word signal is compared with the decision set, a Hrst 10 
category feature signal is output if the predictor word 
signal is a member of the decision set. Otherwise, a 
second category feature is output. 

An example of the operation of the word context 
match 30 and language model 32 can be explained with 15 
reference to Tables 1, 2, and 9. 

TABLE 9 



Yn, where N is an integer greater than or equal to 
three, and n is an integer greater than zero and less 
than or equal to N; and 

outputting a second category feature signal, different 
from the first category feature signal and represent- 
ing a second predicted word different from the first 
predicted word if the predictor word signal is not a 
member of the decision set; 

characterized in that the contents of the decision set 
are generated by the steps of: 

providing a training text comprising a set of observed 
events, each event having a predictor feature X 
representing a predictor word and a category fea- 
ture Y representing a predicted word, said predic- 
tor feature having one of M different possible val- 
ues Xmi each Xm representing a different predictor 





P(Y«. SX^x + D) 


P(Y,^ SXcpKi + D) 


P(Y,tSX^t + D) 


P(Y,f§X^t + D) 




0.033759 


0127322 


0089242 


0.204794 


Y4 


0.037788 


0137308 


0099893 


0220856 


Yj 


a049092 


0186437 


0129774 


0299880 


Y2 


0132102 


0110078 


0349208 


0177057 


Yi 


0.125548 


0.060561 


0.331881 


0.097411 



sx^^i + I) - ixM 

sJci^i + I) - {XiJUJCi) 

1 » 1 



In an utterance of a siring of words, a prior word 
(that is, a word immediately preceding the word to be 
recognized) is tentatively identified as the word 30 
"travel". According to Table 1, the prior word **travel" 
has a predictor feature value Xi. From Table 9, the 
word "travel" is in set SXcp^x-h 1). Therefore, the prob- 
ability of each value Yn of the word to be recognized is 
given by the conditional probability P(Yn|53CflipXt-*-l) 35 
in Table 9. 

If we select the words with the first and second high- 
est conditional probabilities, then the language model 26 
will output words Y3 and Y4 ("management" and "con- 
sultant") as candidates for the recognition word follow- 40 
ing *travel". The candidates will then be presented to 
the detailed acoustic match for further examination. 

We claim: 

1. A method of automatic speech recognition com- 
prising the steps of: 45 

converting an utterance into an utterance signal rep- 
resenting the utterance, said utterance comprising a 
series of at least a predictor word and a predicted 
word, said utterance signal comprising at least one 
predictor word signal representing the predictor 50 
word; 

providing a set of M predictor feature signals, each 
predictor feature signal having a predictor feature 
value Xirit where M is an integer greater than or 
equal to three and m is an integer greater than zero 55 
and lens than or equal to M, each predictor feature 
signal in the set representing a different word; 

generating a decision set which contains a subset of 
the M predictor feature signals representing the 
words; 60 

comparing the predictor word signal with the predic- 
tor feature signals in the decision set; 

outputting a first category feature signd representing 
a first predicted word if the predictor word signal 
is a member of the decision set, said first category 63 
feature signal being one of N category feature sig- 
nals, each category feature signal representing a 
different word and having a category feature value 



word, said category feature having one of N possi- 
ble values Yn, each Yn representing a different 
predicted word; 

(a) measuring the predictor feature value Xm and the 
category feature value Yn of each event in the set of 
events; 

(b) estimating, from the measured predictor feature 
values and the measured category feature values, 
the probability P(Xmi Yn) of occurrence of an 
event having a category feature value Yn and a 
predictor feature value Xmt for each Yn and each 
Xm; 

(c) selecting a starting set SXcpKX) of predictor feature 
values Xm, where t has an initial value; 

(d) calculating, from the estimated probabilities 
P(Xm, Y„), the conditional probability P(SXop/(t- 
)| Yn) that the predictor feature has a value in the 
set SXflipXO when the category feature has a value 
Yn. for each Y„; 

(e) defining a number of pairs of sets SY/t) and S7/t) 
of category feature values Yn. where j is an integer 
greater than zero and less than or equal to (N— I), 
each set SY/t) containing only those category fea- 
ture values Yn having the j lowest values of 
P(SX<y,j(0|Ya), each set SY/t) containing only 
those category feature values Yn having the (N — j) 
highest values of P(SXo;M(t)| Yn); 

(0 finding a pair of sets SYopKt) and ^opfi) from 
among the pairs of seu SY/t) and 57/t) such that 
the pair of sets SYepfi) and SYf^t) have the lowest 
uncertainty in the value of the predictor feature; 

(g) calculating, from the estimated probabilities 
P(Xm, Yn), the conditional probability 
P(SYc^t)|Xn,) that the category feature has a 
value m the set SYo^t) when the predictor feature 
has a value Xm. for each Xm; 

(h) defining a number of pairs of sets SXXt+l) and 
SX/(( + 1) of predictor feature values Xm, where i is 
an integer greater than zero and less than or equal 
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to (M - IX each set SX/(t + 1) containing only those 
predictor feature values Xm havin g th e i lowest 
values of P(SY<^Xt)|X„)» each set SX/(t-f 1) con- 
taining only those predictor feature values Xm 
having the (M-i) highest values of S 
P(SYopXt)|X,„); 

(i) finding a pair of sets SX(9rXt+ 1) and SXc^t+ 1) 
from among the pairs of sets SX/(t+l) and 
^i(t+ 1) such that the pair of sets SX<^t+ 1) and 
5X^i(t+ 1) have the lowest uncertamty in the 10 
value of the category feature; and 

(1) setting the decision set equal to the set SXc^j(t+ 1). 

2. An automatic speech recognition system compris- 
ing: 

means for converting an utterance into an utterance IS 
signal representing the utterance, said utterance 
comprising a series of at least a predictor word and 
a predicted word» said utterance signal comprising 
at least one predictor word signal representing the 
predictor word; 20 

means for storing a set of M predictor feature signals, 
each predictor feature signal having a predictor 
feature value Xm. where M is an integer greater 
than or equal to three and m b an integer greater 
than zero and less than or equal to M, each predic- 25 
tor feature signal in the set representing a different 
word; 

means for generating a decision set which contains a 
subset of the M predictor feature signals represent- 
ing the words; 30 

means for comparing the predictor word signal with 
the predictor feature signals in the decision set; 

means for outputting a first category feature signal 
representing a first predicted word if the predictor 
word signal is a member of the decision set, said 35 
first category feature signal being one of N cate- 
gory feature signals, each category feature signal 
representing a different word and having a cate- 
gory feature value * where N is an integer 
greater than or equal to three, and n is an integer 40 
greater than zero and less than or equal to N; and 

means for outputting a second category feature sig- 
nal, different from the first category feature signal 
and representing a second predicted word different 
from the first predicted word if the predictor word 45 
signal is not a member of the decision set; 

characterized in that the means for generating the 
decision set comprises: 

means for storing a training text comprising a set of 
observed events, each event having a predictor SO 
feature X representing a predictor word and a 
category feature Y representing a predicted word, 
said predictor feature having one of M different 
possible values Xm, each Xm representing a differ- 
ent predictor word, said category feature having SS 
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one of N possible values Yn, each Yn representing a 
different predicted word; 

(a) means for measuring the predictor feature value 
Xm and the category feature value Yr of each event 
in the set of events; 

(b) means for estimating, from the measured predictor 
feature values and the measured category feature 
values, the probability PQim* Yj,) of occurrence of 
an event having a category feature value Yr and a 
predictor feature value Xm, for each Yr and each 
Xm; 

(c) means for selecting a starting set SX<^t) of pre- 
dictor feature values Xmi where t has an initial 
value; 

(d) means for calculating, from the estimated proba- 
bilities P(Xffl, Yr), the conditional probability 
P(SX<^t)|Yff) that the predictor feature has a 
value in die set SX<^t) when the category feature 
has a value Yr, for each Yr; 

(e) means for defining a number of pairs of sets S Y/t) 
and 5Y/t) of category feature values Yr, where j is 
an integer greater than zero and less than or equal 
to (N— 1), each set SY/t) containing only those 
category feature values Yr having the j lowest 
values of PdSXcp^t) \ Yr), each set SY/t) containing 
only those category feature values Yr having the 
(N - j) highest values of P(SXcj^i) \ Y„); 

(0 means for finding a pair of sets S Y<y,<t) and t^opM 
from among the pairs of sets SY/t^and SY/t) such 
that the pair of sets SYop^i) and SYopKt) have the 
lowest uncertainty in the value of the predictor 
feature; 

(g) means for calculating, from the estimated proba- 
bilities P(Xm, Yr). the conditional probability 
P(SYc^t)|Xm) that the category feature has a 
value in ihe set SYcpM when the predictor feature 
has a value Xm, for each Xm; 

(h) means for defining a number of pairs of sets 
SXi(t + 1) and SX^t + 1) of predictor feature values 
Xm, where i is an integer greater than zero and less 
than or equal to (M - 1), each set SX^t + 1) contain- 
ing only those predictor feature values Xm having 
the i lowest values of ?(SYopfii)\Xm% each set 
SXXt+1) containing only those predictor feature 
values Xm having the i) highest values of 
P(SYopKt)lXm); 

(i) means for finding a pair of sets SX<yf((t+l) and 
S^;)f(t-I-I) from among the pairs of sets SXXt+ 1) 
and SXXt-l- 1) such that the pair of sets SXc^f(t+ 1) 
and SXcpXtH- 1) have the lowest uncertainty in the 
value of the category feature; and 

(1) means for outputting the set SXcp/(t-»-l) as the 
decision set 



60 



65 



10/31/2003, EAST Version: 1.4.1 



